Welcome to the second part of the tutorial where we are going to have a look at another popular social media platform YouTube

Packages to install

library(knitr)
#library(magick)
library(png)
library(tuber)
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.6
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## Warning: package 'dplyr' was built under R version 3.5.1
## ── Conflicts ─────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(tidytext)

library(grid)
#library(emo)
library(icon)
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
library(psych)
## 
## Attaching package: 'psych'
## The following object is masked from 'package:icon':
## 
##     fa
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(psych)

options(stringsAsFactors = FALSE) 

Reference material for the following tutorial

Downloading YouTube data with tuber

Once you setup your YouTube access via the Google Developer Console, you can connect to the API and download data.

A reminder: to connect you use yt_oauth().

app_id = "667235664106-a5e6ng7pna0ptv7qoqnrsn0ldm1i68a8.apps.googleusercontent.com" 
app_secret = "8UV8kvt8cZ75EXg-ubOPBjDJ"

#connect
yt_oauth(app_id=app_id, app_secret=app_secret, token='', cache=FALSE)

By default the function looks for .httr-oauth in the working directory in case you connected before. If it doesn’t find it, it passes an application ID and a secret. If you do not want the function to use cache, set cache=FALSE.

The function launches a browser to allow you to authorize the application

Once you are connected, we can searching!

results <- yt_search("World Cup 2018")
kable(results[1:4,1:5])
video_id publishedAt channelId title description
qlZaiBuOaz4 2017-09-09T19:17:50.000Z UC07CHp5ikd-AyafR4J2LWkA No T20 World Cup In 2018, Australia Will host Next WT20 in 2020 It is now officially confirmed by ICC that the 2018 season of ICC WT20 Will not gonna happen,Australia now awarded as the next destination of icc t20 world cup …
9BekP8KQBtY 2018-07-10T19:28:29.000Z UCH0lge-0lFszfPRqFkaoXpw Francia Vs Belgio - World CUP 2018 Semifinale “Calci di Rigore” | PES 2018 Patch [Giù] PES 2018 MiniFaces Pack World Cup Teams http://freedownloadtn.altervista.org/pes%202018%20minifaces%20pack%20world%20cup%20teams.php
s-AYocn9ulU 2018-07-10T14:00:08.000Z UCH0lge-0lFszfPRqFkaoXpw Juventus (CR7⭐) Vs Inter Classic (Adriano) impressionante bomba di Adriano | PES 2018 Patch [Giù] PES 2018 MiniFaces Pack World Cup Teams http://freedownloadtn.altervista.org/pes%202018%20minifaces%20pack%20world%20cup%20teams.php
aWxbphbmDts 2018-01-22T15:48:13.000Z UCqMp3kP5EP0YXANgY-I43JA T20 World Cup 2018 Full Schedule And Time Table ICC World 20 20 Cup ICC T20 World Cup 2018 Full Schedule And Time Table in urdu hindi ICC t20 World Cup 2018 ICC t20 World Cup full time table Hind Pakistan Vs India T20 …

The yt_search function returns a data.frame with 16 elements, including

The yt_search function takes parameters that let your specify your search. The most popular ones are:

Now let’s have a look at comments of the video. We can pick any video in our search and have a look at its comments, using video_id to get them

results <- get_comment_threads(c(video_id="qlZaiBuOaz4"))
kable(results[1:10, c("authorDisplayName", "textDisplay")])
authorDisplayName textDisplay
Suraj Yadav Sale jhooth bolta hai faltu channal ko subscribe nhi kiya jata
Meena Srivastava Nice video bro
bdayal singh My favorite tournament is T20 world cup
DHRUV SAWANT 2020 me hoga
Nikhil Yadav GU hai TU guuu
Vandana2 Rajpurohit yash bhai app tho ipl ke hi video banate ho aap world cricket ke upar bhi video banao
Arjun Bhandari Ipl k bareme batao jiii rcb k bareme kya change hoga
Vikas Mathur Bhosdee ke tuje bada pata h
Munmun Samanta chip Kar faltu
Mohammed Rafique Asia cup kab hoga bhai

You specify your video id in the filter argument. You can also use this argument to download comments for a specific channel. So, let’s have a look at your favourite YouTube channel. Which one is your favourite? Mine is Bloomberg

Channel id for Bloomberg is UCUMZ7gohGI9HcU9VNsr2FJQ. We are going to use more generic function, list_channel_resources that returns a list of requested channel resources.

a <- list_channel_resources(filter = c(channel_id = "UCUMZ7gohGI9HcU9VNsr2FJQ"), part="contentDetails")

# Uploaded playlists:
playlist_id <- a$items[[1]]$contentDetails$relatedPlaylists$uploads

# Get videos on the playlist
vids <- get_playlist_items(filter= c(playlist_id=playlist_id)) 

# Video ids
vid_ids <- as.vector(vids$contentDetails.videoId)

# Function to scrape stats for all vids
get_all_stats <- function(id) {
  get_stats(id)
} 

# Get stats and convert results to data frame 
res <- lapply(vid_ids, get_all_stats)

res_df <- rbindlist(lapply(res, data.frame), fill=TRUE)

head(res_df)
##             id viewCount likeCount dislikeCount favoriteCount commentCount
## 1: j8x7TIm4rK4      5521       320           64             0          181
## 2: 5afmFgfdFzA     82051       886           71             0          173
## 3: 1zDZmIDbDO0     38765      1344           76             0          167
## 4: orR8FSlKGo0     81144      1131          183             0          322
## 5: -pU3lfz2xN8     44859      1180           58             0          155
## 6: l9RWTMNnvi4     51940      1933           23             0          111

But… I guess.. it’s getting too busy for the day, so let’s take a break and have a look at Victoria’s Secrets

It’s Brisbane, Australia after all!

To locate the channel we need to use channel id, not its name.

Get the user/channel id

Obtaining data from YouTube channel and have a look at its stats. is usually done through channel_id, which is not the same as the YouTube name you see in the YouTube link.

There are several ways to obtain your channel_id (YouTube name):

  • Use YouTube Data API https://www.googleapis.com/youtube/v3/channels?key={YOUR_API_KEY}&forUsername={USER_NAME}&part=id where YOUR_API_KEY is the key you create in the google developer account USER_NAME is the YouTube channel (username)

  • Use the page source code Open the YouTube page in the browser and view Source page. Search for either externalId or data-channel-external-id. The value there will be the channel id.

Downloading channel stats

Now that we have the channel id, lets get a list of its videos and have a look at its stats

channel_id<-"UChWXY0e-HUhoXZZ_2GlvojQ"
videosVS = yt_search(term="", type="video", channel_id = channel_id)
kable(videosVS[1:4,1:5])
video_id publishedAt channelId title description
HLo_3GfHjps 2012-11-30T17:10:00.000Z UChWXY0e-HUhoXZZ_2GlvojQ Miranda Kerr and Bruno Mars Backstage at the 2012 Victoria’s Secret Fashion Show Get ready for dimples, winks and giggles when Miranda Kerr quizzes Bruno Mars backstage at the 2012 Victoria’s Secret Fashion Show. Catch them both on the …
76gf3YeGf1s 2010-11-30T17:49:42.000Z UChWXY0e-HUhoXZZ_2GlvojQ Before I Was a Supermodel: Behati Prinsloo On the eve of the 2010 Fashion Show, Supermodel Behati Prinsloo shares some memories about growing up and becoming a model.
wMYYQkwLDqM 2013-10-17T15:27:54.000Z UChWXY0e-HUhoXZZ_2GlvojQ Candice Swanepoel Meets the Royal Fantasy Bra Go on set as Victoria’s Secret Angel Candice Swanepoel gets to see and try on the Royal Fantasy Bra for the very first time. Look for Candice and the $10 million …
anhEcKWRzyE 2017-10-16T17:06:39.000Z UChWXY0e-HUhoXZZ_2GlvojQ Road to the Runway: Episode 1 – Castings The 2017 Victoria’s Secret Fashion Show magic kicks off with castings. Go behind the scenes as this year’s models make their way to wearing those …
#get channel stats
statsVS<-get_channel_stats(channel_id=channel_id)
## Channel Title: Victoria's Secret 
## No. of Views: 266052763 
## No. of Subscribers: 1505487 
## No. of Videos: 743
statsVSSelected <- as.vector(statsVS$statistics)
results<-do.call(rbind, statsVSSelected)
head(results)
##                       [,1]       
## viewCount             "266052763"
## commentCount          "0"        
## subscriberCount       "1505487"  
## hiddenSubscriberCount "FALSE"    
## videoCount            "743"

The get_channel_stats function is quite straightforward. It take channel_id as an argument and returns a nested list. We can select the items we need from the list and convert it to a data.frame

Downloading stats for videos

Now that we have a list of videos from VS channel, let’s download stats for each video. As an example let’s do first 10

videosVS_sample<-videosVS[1:10,]

videoStatsVS = lapply(as.character(videosVS_sample$video_id), function(x){
  get_stats(video_id = x)
})
videoStatsVS_df = do.call(rbind.data.frame, videoStatsVS)
head(videoStatsVS_df)
##             id viewCount likeCount dislikeCount favoriteCount commentCount
## 2  HLo_3GfHjps   2265269     16365          277             0         1826
## 21 76gf3YeGf1s    886327      5478           72             0          299
## 3  wMYYQkwLDqM   2491692     17840          299             0         1306
## 4  anhEcKWRzyE   2173214     37751          572             0         1672
## 5  DkcctlhyWEs    954716      9375          202             0          402
## 6  LwelJNogl-M    883488      7721           88             0          200

The function uses video ids, but does not return video title and dates, which we can add ourselves and do some clean-up

videoStatsVS_df$title = videosVS_sample$title
videoStatsVS_df$date = videosVS_sample$date

library(tidyverse)
videoStatsVS_df = as.tibble(videoStatsVS_df) %>%
  mutate(viewCount = as.numeric(as.character(viewCount)), #originally as factor
         likeCount = as.numeric(as.character(likeCount)),
         dislikeCount = as.numeric(as.character(dislikeCount)),
         commentCount = as.numeric(as.character(commentCount)))

head(videoStatsVS_df)
## # A tibble: 6 x 7
##   id     viewCount likeCount dislikeCount favoriteCount commentCount title
##   <chr>      <dbl>     <dbl>        <dbl> <chr>                <dbl> <chr>
## 1 HLo_3…   2265269     16365          277 0                     1826 Mira…
## 2 76gf3…    886327      5478           72 0                      299 Befo…
## 3 wMYYQ…   2491692     17840          299 0                     1306 Cand…
## 4 anhEc…   2173214     37751          572 0                     1672 Road…
## 5 Dkcct…    954716      9375          202 0                      402 Rome…
## 6 LwelJ…    883488      7721           88 0                      200 Vict…

I ran the function with the full list of video ids for the channel and you can use this file videoStatsVS_df.csv under the YouTube tutorial-data folder. Let’s load it

videoStatsVS_df <- as.data.table(read.csv("YouTube tutorial-data/videoStatsVS_df.csv", stringsAsFactors=FALSE))
head(videoStatsVS_df)
##    X          id viewCount likeCount dislikeCount favoriteCount
## 1: 1 HLo_3GfHjps   2259491     16317          277             0
## 2: 2 76gf3YeGf1s    884766      5463           72             0
## 3: 3 wMYYQkwLDqM   2490140     17827          299             0
## 4: 4 DkcctlhyWEs    953821      9353          202             0
## 5: 5 anhEcKWRzyE   2165304     37656          569             0
## 6: 6 LwelJNogl-M    883017      7711           88             0
##    commentCount
## 1:         1826
## 2:          299
## 3:         1305
## 4:          402
## 5:         1673
## 6:          200
##                                                                               title
## 1: Miranda Kerr and Bruno Mars Backstage at the 2012 Victoria's Secret Fashion Show
## 2:                                       Before I Was a Supermodel: Behati Prinsloo
## 3:                                    Candice Swanepoel Meets the Royal Fantasy Bra
## 4:                               Romee Strijd on Becoming a Victoria’s Secret Angel
## 5:                                         Road to the Runway: Episode 1 – Castings
## 6:                                            Victoria’s Secret Angel Outtakes 2014

Let’s see which video was most popular:

The most view counts are:

videoStatsVS_df %>% arrange_(~ desc(viewCount)) %>%
  top_n(n = 5) %>% 
  select(title, viewCount, likeCount, favoriteCount, commentCount, id)
##                                    title viewCount likeCount favoriteCount
## 1 Why Alessandra Ambrosio loves her body    773575      3250             0
## 2                        You've Got Male    422072       881             0
## 3    Wild & Beautiful: Behind the Scenes    338447      1115             0
## 4   Why Lindsay Ellingson loves her body    178581       320             0
## 5    Why Emanuela DePaula loves her body    118804       339             0
##   commentCount          id
## 1           23 EMOxRWd-Q1k
## 2           19 drxva8UM-zM
## 3           17 nmTT42Q0bsI
## 4           17 B3QSmqcTojQ
## 5           11 jANpsZGKPXA

Let’s have a look at it!

The most likes are:

videoStatsVS_df %>% arrange_(~ desc(likeCount)) %>%
  top_n(n = 5) %>% 
  select(title, likeCount, viewCount, favoriteCount, commentCount, id)
##                                    title likeCount viewCount favoriteCount
## 1 Why Alessandra Ambrosio loves her body      3250    773575             0
## 2    Wild & Beautiful: Behind the Scenes      1115    338447             0
## 3                        You've Got Male       881    422072             0
## 4    Why Emanuela DePaula loves her body       339    118804             0
## 5   Why Lindsay Ellingson loves her body       320    178581             0
##   commentCount          id
## 1           23 EMOxRWd-Q1k
## 2           17 nmTT42Q0bsI
## 3           19 drxva8UM-zM
## 4           11 jANpsZGKPXA
## 5           17 B3QSmqcTojQ

The most comments got:

videoStatsVS_df %>% arrange_(~ desc(commentCount)) %>%
  top_n(n = 5) %>% 
  select(title, commentCount, viewCount, likeCount, favoriteCount, id)
##                                    title commentCount viewCount likeCount
## 1 Why Alessandra Ambrosio loves her body           23    773575      3250
## 2                        You've Got Male           19    422072       881
## 3    Wild & Beautiful: Behind the Scenes           17    338447      1115
## 4   Why Lindsay Ellingson loves her body           17    178581       320
## 5    Why Emanuela DePaula loves her body           11    118804       339
##   favoriteCount          id
## 1             0 EMOxRWd-Q1k
## 2             0 drxva8UM-zM
## 3             0 nmTT42Q0bsI
## 4             0 B3QSmqcTojQ
## 5             0 jANpsZGKPXA

Downloading titles and comments

Let’s have a look at titles now and see what we can find there.

Continuing with the “girl power” theme, let’s compare VS to another powerhouse, US Vogue.

Following the procedure described earlier, I downloaded video stats for both VS and US Vogue channels and merge them into one file, videostats_All.csv

Further analysis will include manipulation with text, so we will need tidyverse and tidytext packages

library(tidyverse)
library(tidytext)

you can either download the channel stats yourself (see above) or use videostats_All.csv

Channel ids are: * Americanvogue = UCRXiA3h1no_PFkb1JCP0yMA * VICTORIASSECRET= UChWXY0e-HUhoXZZ_2GlvojQ

To load the existing file let’s do this

videostats_All <- as.data.table(read.csv("YouTube tutorial-data/videostats_All.csv", stringsAsFactors=FALSE))
head(videostats_All)
##          date
## 1: 2016-01-05
## 2: 2016-01-18
## 3: 2016-01-19
## 4: 2016-01-22
## 5: 2016-02-02
## 6: 2016-02-11
##                                                                    title
## 1:                   Inside the Brooklyn Home of Artist Mickalene Thomas
## 2:                      Watch Irene Kim’s Japanese Hot Springs Adventure
## 3:                                     73 Questions With Derek Zoolander
## 4: Watch Pop Star in the Making Zella Day Perform a Heartbreaking Ballad
## 5:                       Cleaning House With Organizing Guru Marie Kondo
## 6:      Here’s What It’s Like to Be Lucky Blue Smith During Fashion Week
##       video_id viewCount likeCount dislikeCount commentCount source
## 1: SMX60fh5u7o     29346      1304            7           27  Vogue
## 2: r-shJpCflvQ     59103      1206           29           38  Vogue
## 3: H4q0K561WXs   2395735     34455         1817         1754  Vogue
## 4: wg19p1cDSRE     58920      2104            9          101  Vogue
## 5: z3OXvQZe7g8    392529      4121          870          171  Vogue
## 6: LIKTr9vXRQI    141677      2853           52          131  Vogue

Let’s have a brief look at the data. We are going to use the stargazer package which is fantastic for generating “academic” looking results and describeBy function from the psych package that generates statistics by a grouping variable. We will group variables by channel is.

library(stargazer)

stargazer(videostats_All[,.(viewCount, likeCount, commentCount, dislikeCount)], median=TRUE, digit= 1, type = "text")
## 
## ============================================================================================
## Statistic     N      Mean        St. Dev.     Min  Pctl(25)  Median    Pctl(75)      Max    
## --------------------------------------------------------------------------------------------
## viewCount    456 1,199,256.000 2,452,764.000 6,477 115,134  367,336.5 1,082,360.0 20,665,383
## likeCount    456  25,219.360    66,559.910    145   1,992    5,040.5    18,092     747,921  
## commentCount 455   1,150.145     3,401.707   4.000  85.500   202.000    707.500   31,101.000
## dislikeCount 456    727.618      2,297.890     1      33      88.5        425       29,639  
## --------------------------------------------------------------------------------------------
## 
## =
## 1
## -
library(psych)

results<-describeBy(videostats_All[, .(viewCount, likeCount, commentCount, dislikeCount)], 
          group=videostats_All$source, digits=1, mat=TRUE) 

results[,c(1:7, 10:11)]
##               item group1 vars   n      mean        sd median   min
## viewCount1       1  Vogue    1 307 1512729.9 2874266.3 478164  6477
## viewCount2       2     VS    1 149  553373.0  889076.4 208598 31747
## likeCount1       3  Vogue    2 307   35448.6   79096.3  10367   145
## likeCount2       4     VS    2 149    4143.1    4528.7   2468   722
## commentCount1    5  Vogue    3 306    1611.7    4067.6    359     4
## commentCount2    6     VS    3 149     202.3     235.9    111    17
## dislikeCount1    7  Vogue    4 307    1042.1    2745.7    211     1
## dislikeCount2    8     VS    4 149      79.7     136.8     43     5
##                    max
## viewCount1    20665383
## viewCount2     5348750
## likeCount1      747921
## likeCount2       37661
## commentCount1    31101
## commentCount2     1673
## dislikeCount1    29639
## dislikeCount2     1189

Just a reminder that: * mean: average * st. dev: is a measure of variation in the data compared to the average * min and max: extreme values

Likely that the number of views relates to the number of likes: the more people view the video, the more they “like” it. We can do a correlation for this using corr.test function from the same psych package

results<-corr.test(videostats_All[, .(viewCount, likeCount, commentCount, dislikeCount)], use = "complete",method="pearson",adjust="holm",
          alpha=.05,ci=FALSE)
results$r
##              viewCount likeCount commentCount dislikeCount
## viewCount    1.0000000 0.8130140    0.8078345    0.8286073
## likeCount    0.8130140 1.0000000    0.9469651    0.6756837
## commentCount 0.8078345 0.9469651    1.0000000    0.7457451
## dislikeCount 0.8286073 0.6756837    0.7457451    1.0000000

Or we can have a plot it on a graph with the help of gglot2 and gridExtra

library (ggplot2)
library(gridExtra)
p1=ggplot(data = videostats_All[-1, ]) + geom_point(aes(x = viewCount, y = likeCount))
p2=ggplot(data = videostats_All[-1, ]) + geom_point(aes(x = viewCount, y = dislikeCount))
p3=ggplot(data = videostats_All[-1, ]) + geom_point(aes(x = viewCount, y = commentCount))
grid.arrange(p1, p2, p3, ncol = 2)

You just cannot LOVE Adriana Lima! But move on….

Text analysis

As you see the title column has a title that describes the video. It is logically to assume that title is the first to attract attention of the viewer. Let’s have a closer look and see if we can identify specific words.

Let’s tokenize the title, clean it from stop words and calculate frequencies of words in the title. Frequency is calculated as the number of times a particular word is used in the title compared to the total number of different words used in the channel.

title_words_All_Source<-videostats_All %>%
  as.tibble() %>% 
  unnest_tokens(word, title) %>%
  anti_join(stop_words) %>%
  count(source, word, sort = TRUE) %>%
  left_join(videostats_All %>% 
              group_by(source) %>% 
              summarise(total = n())) %>%
  mutate(freq = n/total) 
## Joining, by = "word"
## Joining, by = "source"

I saved the output into title_words_All_Source.csv file , so you can load it and continue working with a tokenized version of the text.

title_words_All_Source <- as.data.table(read.csv("YouTube tutorial-data/title_words_All_Source.csv", stringsAsFactors=FALSE))
head(title_words_All_Source)
##    source       word   n total      freq
## 1:  Vogue      vogue 248   315 0.7873016
## 2:     VS victoria’s 100   148 0.6756757
## 3:     VS     secret  98   148 0.6621622
## 4:  Vogue        met  84   315 0.2666667
## 5:  Vogue       gala  82   315 0.2603175
## 6:  Vogue     beauty  61   315 0.1936508

We can compare frequencies of the word usage between VS and Vogue:

frequencyVS_Vogue <- title_words_All_Source %>% 
  select(source, word, freq) %>% 
  spread(source, freq) %>%
  arrange(VS, Vogue)
head(frequencyVS_Vogue, 10)
##         word       Vogue          VS
## 1         60 0.003174603 0.006756757
## 2  backstage 0.003174603 0.006756757
## 3       butt 0.003174603 0.006756757
## 4      david 0.003174603 0.006756757
## 5   fittings 0.003174603 0.006756757
## 6       it’s 0.003174603 0.006756757
## 7     lima’s 0.003174603 0.006756757
## 8        lip 0.003174603 0.006756757
## 9      music 0.003174603 0.006756757
## 10  official 0.003174603 0.006756757

So thqt it make more sense let’s show word usage by building up the wordcloud. This will allow us to see the most prominent words immediately.

words_VS <-title_words_All_Source%>% 
filter(source=="VS")%>%
select(word, n)

words_Vogue <-title_words_All_Source%>% 
filter(source=="Vogue")%>%
select(word, n)

library(wordcloud)      
wordcloud(words_VS$word, words_VS$n)

wordcloud(words_Vogue$word, words_Vogue$n)

library(wordcloud2)
wordcloud2(data=words_VS, size = 1,
color = "random-light",  shape = 'star')

Moving on further into the text analysis, let’s have a look at the sentiment.

Sentiment analysis is a tool that assess the tone and the voice of the text. It uses pre-made dictionaries with words classified into categories. Different dictionaries use different categories. For example, the nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. We can get the list of words in the dictionary and there further options to adjust the list:

get_sentiments("afinn")  #assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.
## # A tibble: 2,476 x 2
##    word       score
##    <chr>      <int>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,466 more rows
get_sentiments("bing") #categorizes words in a binary fashion into positive and negative categories. 
## # A tibble: 6,788 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faced     negative 
##  2 2-faces     negative 
##  3 a+          positive 
##  4 abnormal    negative 
##  5 abolish     negative 
##  6 abominable  negative 
##  7 abominably  negative 
##  8 abominate   negative 
##  9 abomination negative 
## 10 abort       negative 
## # ... with 6,778 more rows
get_sentiments("nrc") #categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.
## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows

We will plot the results and will use ggplot2 package for this.

library(ggplot2)
title_words_All_Source %>%
  inner_join(get_sentiments("nrc")) %>%
  group_by(source, sentiment) %>%
  summarize(n = n()) %>%
  ggplot(aes(x = source, y = n, fill = source)) + 
  geom_bar(stat = "identity", alpha = 0.8) + 
  facet_wrap(~ sentiment, ncol = 5) 

On the next steps, we can have a look at comments for different sources and compare the word usage in the audience. Downloading all comments takes time.. You can do it on your own using the following function:

commentsVS = lapply(as.character(videosVS$video_id), function(x){
  get_comment_threads(c(video_id = x), max_results = 50)
)

Please note that if the number of comments is too high, you may get errors.. To avoid, limit the max_results.

or you can use the file I created for you. I used VS (you should already have it!), US Vogue and British Vogue channels for the dataset comments_All.csv

comments_All <- as.data.table(read.csv("YouTube tutorial-data/comments_All.csv", stringsAsFactors=FALSE))
head(comments_All)
##                                                                                                      comment
## 1: hey you guys!\xd0___ is this on january 7 or on july 1st (sorry for my english i am from the netherlands)
## 2:                                                                                   What happens in July 1?
## 3:                                                     Check out my model work at Facebook.com/journeytofame
## 4:                                                                                                first okno
## 5:                                                                                                first okno
## 6:                                                                                                       hey
##    source
## 1:     VS
## 2:     VS
## 3:     VS
## 4:     VS
## 5:     VS
## 6:     VS

As you see we need to clean the text. it does look messy…..

replace_reg <- "https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&amp;|&lt;|&gt;|RT|https"
unnest_reg <- "([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"

comments_words<-comments_All %>%
  as.tibble() %>% 
  unnest_tokens(word, comment, token = "regex", pattern = unnest_reg) %>%
  filter(!word %in% stop_words$word,str_detect(word, "[a-z]")) %>%
  count(source, word, sort = TRUE) 

Now let’s apply the sentiment:

library(ggplot2)
comments_words %>%
  inner_join(get_sentiments("bing")) %>%
  group_by(source, sentiment) %>%
  summarize(n = n()) %>%
  ggplot(aes(x = source, y = n, fill = source)) + 
  geom_bar(stat = "identity", alpha = 0.8) + 
  facet_wrap(~ sentiment, ncol = 5) 

Topic modelling

Let’s see what topics we can find in the description of the videos and compare them across three channels, US Vogue, British Vogue and VS. Does the description of the video affect the number of likes, views and dislikes?

For the topic modeling we will go to the topicmodels package. It does not work well with tidytext, so we will use the tm package.

We will use the dataset we have already videosVS - it has a list of all videos from the channel as well as its description. I also downloaded the same data from US Vogue and British Vogue. I saved it to the description_All.csv file for you to use.

library(topicmodels)
library(tm)

description_All <- as.data.table(read.csv("YouTube tutorial-data/description_All.csv", stringsAsFactors=FALSE))
description_All$source<-as.factor(description_All$source)
head (description_All)
##                                                                                                                                                              description
## 1:         French model and girl of the moment Jeanne Damas kicks off the new Vogue Shops series with a trip to one of France's iconic pharmacies, where she reveals ...
## 2:   Gigi Hadid and Bella Hadid, cover stars of Vogue's March 2018 issue, take the sculpture challenge to find out just how well they know each other – but who will ...
## 3:                Always keep your heels, head and standards high.” Models Alecia Morais, Azza Slimene, Demy de Vries and Loane Normand demonstrate a number of Coco ...
## 4:   "I'm very grateful to be here, and that overpowers any of the negative elements that come with fame." Vogue wakes up with Ariana Grande, cover star of the July ...
## 5: What does Victoria Beckham carry inside her bag? In this episode of British Vogue's 'In The Bag,' our October cover star shows us the items she simply can't live ...
## 6:         British Vogue spends the day with Seven "Posh" Boys, keep watching to see what was in store for the day! THERE was a merry twinkle in the eye of those in ...
##       video_id  source
## 1: X69Wd3b9Y04 VogueUK
## 2: r5rYxGK_VtA VogueUK
## 3: kTg8gKfq8TI VogueUK
## 4: n2wIqXBz4os VogueUK
## 5: XKMZR5uXAvk VogueUK
## 6: FOMAARZ8pmo VogueUK

Topic modeling is a sophisticated technique that is used to classify text and identify themes.

The most promising and widely used algorith at this time is Latent Dirichlet Allocation (LDA). The beauty of algorithm is that it allows overlapping between topics (=themes). It assumes that each document is a collection of topics. These topics can also be shared among several documents. How do we identify which topics are covered in the document?

We look at the document and identify specific words and in the document that relate to a particular topic. How do we do this? We just randomly assigns words to topics.

Do we accurately know how many topis are there? No. We Just Do It and experment with different number of topics /documents/ words again and again and again to see what works best for our text. The whole processes is based on probabilities assessment, probability of a topic in a document and probability of a word in a topic. This process goes iteratively until it reaches some stable state. The document is assigned a topic(s) based on the proportion of the words assigned to each topic in this document and words are assigned to the topic based on the proportion of words assigned to the topic.

The math side of the algorithm is quite heavy. Indeed, you oare welcome to dig out the original work of Blei et al. 2003 “Latent Dirichlet Allocation”, but I would suggest that we try to do some practical things to get you going!

So. The tm package.

It has a different philosophy than tidytext.

The main concept for tm way of managing documents is a Corpus

So, let’s load our data there and point to the text

corpus = Corpus(VectorSource(description_All$description))

Now that we have our corpus to analyse we need to clean it.

corpus <- tm_map(corpus, tolower) #convert to lowercase
corpus <- tm_map(corpus, removePunctuation) #remove punctuation
corpus <- tm_map(corpus, removeWords, stopwords("english")) #remove stopwords

While the tidytext package has a very useful unnest_tokens function that not only tokenize the text, but also does some basic cleanup, including converting to lowercase, the tm package requires several steps.

Next is stemming: you can think about it as a grand-parent of tokenization… Well, not exactly. Linquists see a really big difference between the two and add lemmazation to the discussion as well. Stemming usually “chops” of the word to get a “unique” form of the word. Lemmazation is more sophisticated and looks into vocabulary and morphological analysis of the word. Tokenization is different: it is segmenting the text where the segment can be a word, a sentence, lines, etc. Let’s do it!

corpus <- tm_map(corpus, stemDocument)

Next, generating a document-term-matrix and removing rare words:

Document-term-matrix (dtm) is a matrix that describe frequencies of words in a document. Rows are documents and columns are words (=terms)

dtm = DocumentTermMatrix(corpus)

The first ten terms from our matrix: dama, franc, french, girl, icon, jeann, kick, model, moment, new

DTM can get big… very big. But not all terms in the dtm are that important. Some just add noise and take processing time.

# Remove sparse terms
dtm = removeSparseTerms(dtm, 0.997) #0.997 is sparsity: the function removes only terms that are more sparse than 0.997

Let’s review most frequent terms and draw a wordcloud:

findFreqTerms(dtm, 1000)
## character(0)

and finally create a data.frame for further analysis

labeledTerms = as.data.frame(as.matrix(dtm))

labeledTerms = labeledTerms[rowSums(abs(labeledTerms)) != 0,]

Now lets have a look at what we can find the LDA. We need to specify the number of topics - it is hard to do, but we need to start with something. We will have a look at approaches to get more accurate number of topics. So? We need a number….

news_lda <- LDA(labeledTerms, k = 5, control = list(seed = 13))# set a seed so that the output of the model is predictable

We are back to the tidytext package to make the output more readable

news_topics <- tidy(news_lda, matrix = "beta")

news_top_terms <- news_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

news_top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip() + theme_bw()

Now we need to have a look at terms in each topic to see what topic they can “make”… You rarely get the number of topics right in the first run!

There are ways to mathematically estimate the number of topics. They are quite advanced and take time and resources to run. I ran it for us using the ldatuning package and here are the results ldaResults. The function is indeed resource intensive and it took half an hour to process on this “small” dataset.

The function to use there is FindTopicsNumber( labeledTerms, topics = seq(from = 2, to = 80, by = 1), metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"), method = "Gibbs", control = list(seed = 77), mc.cores = 2L, verbose = TRUE ) and it is easier to work with the results on the plot

FindTopicsNumber_plot(result)